Fundamental Techniques in Data Science with R

Very useful

Overview of this course

Program

Week # Topic R-practical Workgroup
1 The elemental building blocks of R Assigning objects and elements; creating vectors, matrices, dataframes and lists Receive instructions and form groups
2 Finding the least squares solution; simple linear regression Subsetting data; using pipes to simplify the workflow Locate a data set for predictive modeling and formulate a research hypothesis; make sure that the set facilitates continuous and dichotomous outcomes
3 Linear modeling in R; testing assumptions; standardized residuals, leverage and Cook’s distance Class lm in R; modeling, prediction and visualization Fit your defined model; evaluate if assumptions are met
4 Inferential modeling; Confidence intervals and hypothesis testing, non-constant error variance Demonstrate confidence validity of the linear model on simulated data with rmarkdown Test and quantify the effect of the defined model; continue the project in rmarkdown
5 Model evaluation; cross-validation; categorical variables, non-linear relations, interactions and higher-order polynomials Cross-validation and model fit in R Evaluate if the model can be improved; Prepare assignment A; evaluate the final linear model on your own data

Program

Week # Topic R-practical Workgroup
6 Simple logistic regression Class glm(formula, family = "binomial") in R; modeling, prediction and visualization Fit your defined model; evaluate if assumptions are met
7 Formulating the logistic model and interpreting the parameters; marginal effects Parameter transformations; scale of the predictor/outcome and prediction and confidence intervals Test and quantify the effect of the defined model
8 Logistic regression model evaluation; cross-validation; multiple regression; interactions Multiple logistic regression and cross-validating the logistic regression in R Evaluate if the model can be improved; Prepare assignment B; evaluate the final logistic model on your own data

Goal of this course

Formal Goals

  1. apply and interpret the basic methodological and statistical concepts that are associated with doing predictive and/or inferential research;
  2. apply and interpret important techniques in linear and logistic regression analysis;

This means that you will learn the ins and outs of inferential and predictive research with linear and logistic models.

  • what this all covers will become clear during the course
  • we will learn R to perform our data analysis and visualizations
  • we will learn the math and skills behind these ubiquitous modeling techniques
  • we will also learn the assumptions of (logistic) regression models

Real-world Goals

Learn to keep your cool

HTML5 Icon

and build the foundation for a succesfull scripting career in predictive and inferential analytics

What is R?

Software

HTML5 Icon

The origin of R

  • R is a language and environment for statistical computing and for graphics

  • GNU project (100% free software)

  • Managed by the R Foundation for Statistical Computing, Vienna, Austria.

  • Community-driven

  • Based on the object-oriented language S (1975)

What is RStudio?

Integrated Development Environment

HTML5 Icon

RStudio

  • Aggregates all convenient information and procedures into one single place
  • Allows you to work in projects
  • Manages your code with highlighting
  • Gives extra functionality (Shiny, knitr, markdown, LaTeX)
  • Allows for integration with version control routines, such as Git.

How does R work

Objects and elements

  • R works with objects that consist of elements. The smallest elements are numbers and characters.

    • These elements are assigned to objects.
    • A set of objects can be used to perform calculations
    • Calculations can be presented as functions
    • Functions are used to perform calculations and return new objects, containing calculated (or estimated) elements.

The help

  • Everything that is published on the Comprehensive R Archive Network (CRAN) and is aimed at R users, must be accompanied by a help file.
  • If you know the name of the function that performs an operation, e.g. anova(), then you just type ?anova or help(anova) in the console.
  • If you do not know the name of the function: type ?? followed by your search criterion. For example ??anova returns a list of all help pages that contain the word ‘anova’

  • Alternatively, the internet will tell you almost everything you’d like to know and sites such as http://www.stackoverflow.com and http://www.stackexchange.com, as well as Google can be of tremendous help.
    • If you google R related issues; use ‘R:’ as a prefix in your search term

Assigning elements to objects

  • Assigning things in R is very straightforward:

    • you just use <-
  • For example, if you assign the value 100 (an element) to object a, you would type

a <- 100

Calling objects

  • Calling things in R is also very straightforward:

    • you just use type the name you have given to the object
  • For example, we assigned the value 100 to object a. To call object a, we would type

a
## [1] 100

Writing code

HTML5 Icon

This is why we use R-Studio.

Objects that contain more than one element

More than one element

  • We can assign more than one element to a vector (in this case a 1-dimensional congatenation of numbers 1 through 5)
a <- c(1, 2, 3, 4, 5)
a
## [1] 1 2 3 4 5
b <- 1:5
b
## [1] 1 2 3 4 5

More than one element, with characters

Characters (or character strings) in R are indicated by the double quote identifier.

a.new <- c(a, "A")
a.new
## [1] "1" "2" "3" "4" "5" "A"

Notice the difference with a from the previous slide

a
## [1] 1 2 3 4 5

Quickly identifying elements in vectors

rep(a, 15)
##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
## [36] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
## [71] 1 2 3 4 5

Calling elements in vectors

If we would want just the third element, we would type

a[3]
## [1] 3

Multiple vectors in one object

This we would refer to as a matrix

c <- matrix(a, nrow = 5, ncol = 2)
c
##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3
## [4,]    4    4
## [5,]    5    5

Calling elements in matrices #1

  • The first row is called by
c[1, ]
## [1] 1 1
  • The second column is called by
c[, 2]
## [1] 1 2 3 4 5

Calling elements in matrices #2

  • The intersection of the first row and second column is called by
c[1, 2]
## [1] 1

In short; square brackets [] are used to call elements, rows, columns (and much more beyond the scope of this course)

Matrices with mixed numeric / character data

If we add a character column to matrix c; everything becomes a character:

cbind(c, letters[1:5])
##      [,1] [,2] [,3]
## [1,] "1"  "1"  "a" 
## [2,] "2"  "2"  "b" 
## [3,] "3"  "3"  "c" 
## [4,] "4"  "4"  "d" 
## [5,] "5"  "5"  "e"

Matrices with mixed numeric / character data

Alternatively,

cbind(c, c("a", "b", "c", "d", "e"))
##      [,1] [,2] [,3]
## [1,] "1"  "1"  "a" 
## [2,] "2"  "2"  "b" 
## [3,] "3"  "3"  "c" 
## [4,] "4"  "4"  "d" 
## [5,] "5"  "5"  "e"

Remember, matrices and vectors are numerical OR character objects. They can never contain both and still be used for numerical calculations.

Data frames

d <- data.frame("V1" = rnorm(5),
                "V2" = rnorm(5, mean = 5, sd = 2), 
                "V3" = letters[1:5])
d
##            V1       V2 V3
## 1  0.08277864 5.849016  a
## 2  0.54542935 3.292187  b
## 3  0.19208169 2.294143  c
## 4 -0.65761705 2.523033  d
## 5 -0.85008235 4.005873  e

We ‘filled’ a dataframe with two randomly generated sets from the normal distribution - where \(V1\) is standard normal and \(V2 \sim N(5,2)\) - and a character set.

Data frames (continued)

Data frames can contain both numerical and character elements at the same time, although never in the same column.

You can name the columns and rows in data frames (just like in matrices)

row.names(d) <- c("row 1", "row 2", "row 3", "row 4", "row 5")
d
##                V1       V2 V3
## row 1  0.08277864 5.849016  a
## row 2  0.54542935 3.292187  b
## row 3  0.19208169 2.294143  c
## row 4 -0.65761705 2.523033  d
## row 5 -0.85008235 4.005873  e

Calling row elements in data frames

There are two ways to obtain row 3 from data frame d:

d["row 3", ]
##              V1       V2 V3
## row 3 0.1920817 2.294143  c

and

d[3, ]
##              V1       V2 V3
## row 3 0.1920817 2.294143  c

The intersection between row 2 and column 4 can be obtained by

d[2, 3]
## [1] b
## Levels: a b c d e

Calling columns elements in data frames

Both

d[, "V2"] # and
## [1] 5.849016 3.292187 2.294143 2.523033 4.005873
d[, 2]
## [1] 5.849016 3.292187 2.294143 2.523033 4.005873

yield the second column. But we can also use $ to call variable names in data frame objects

d$V2
## [1] 5.849016 3.292187 2.294143 2.523033 4.005873

Beyond two dimensions

If you wish to use numerical objects that have more than two dimension, an array would be a suitable object. The following code yields a 3-dimensional array (2 rows, 4 columns and 3 matrices):

e <- array(1:24, dim = c(2, 4, 3))
e
## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]    9   11   13   15
## [2,]   10   12   14   16
## 
## , , 3
## 
##      [,1] [,2] [,3] [,4]
## [1,]   17   19   21   23
## [2,]   18   20   22   24

Indexing an array

The square bracket identification works similarly to the identification of matrices and dataframes, but with the added dimension(s). For example,

e[1, 3, 2]
## [1] 13

yields the element in the first row of the third column in the second matrix. This is exactly the downside to an array: it is a series of matrices.

In other words, characters and numerical elements may not be mixed.

Potential problem with array

If we replace the third matrix in the array by a character version of that matrix, we obtain

e[, , 3] <- as.character(e[, , 3])
e
## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,] "1"  "3"  "5"  "7" 
## [2,] "2"  "4"  "6"  "8" 
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,] "9"  "11" "13" "15"
## [2,] "10" "12" "14" "16"
## 
## , , 3
## 
##      [,1] [,2] [,3] [,4]
## [1,] "17" "19" "21" "23"
## [2,] "18" "20" "22" "24"

Solution: a list

List are just what it says they are: lists. You can have a list of everything mixed with everything. For example, an simple list can be created by

f <- list(a)
f
## [[1]]
## [1] 1 2 3 4 5

Elements or objects within lists can be called by using double square brackets [[]]. For example, the first (and only) element in list f is object a

f[[1]]
## [1] 1 2 3 4 5

Lists (continued)

We can simply add an object or element to an existing list

f[[2]] <- d
f
## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
##                V1       V2 V3
## row 1  0.08277864 5.849016  a
## row 2  0.54542935 3.292187  b
## row 3  0.19208169 2.294143  c
## row 4 -0.65761705 2.523033  d
## row 5 -0.85008235 4.005873  e

to obtain a list with a vector and a data frame.

Lists (continued)

We can add names to the list as follows

names(f) <- c("vector", "data frame")
f
## $vector
## [1] 1 2 3 4 5
## 
## $`data frame`
##                V1       V2 V3
## row 1  0.08277864 5.849016  a
## row 2  0.54542935 3.292187  b
## row 3  0.19208169 2.294143  c
## row 4 -0.65761705 2.523033  d
## row 5 -0.85008235 4.005873  e

Calling elements in lists

Calling the vector (a) from the list can be done as follows

f[[1]]
## [1] 1 2 3 4 5
f[["vector"]]
## [1] 1 2 3 4 5
f$vector
## [1] 1 2 3 4 5

Lists in lists

Take the following example

g <- list(f, f)

To call the vector from the second list within the list g, use the following code

g[[2]][[1]]
## [1] 1 2 3 4 5
g[[2]]$vector
## [1] 1 2 3 4 5

Logical operators

  • Logical operators are signs that evaluate a statement, such as ==, <, >, <=, >=, and | (OR) as well as & (AND). Typing ! before a logical operator takes the complement of that action. There are more operations, but these are the most useful.

  • For example, if we would like elements out of matrix c that are larger than 3, we would type:

c[c > 3]
## [1] 4 5 4 5

Why does a logical statement on a matrix return a vector?

c > 3
##       [,1]  [,2]
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE FALSE
## [4,]  TRUE  TRUE
## [5,]  TRUE  TRUE

The column values for TRUE may be of different length. A vector as a return is therefore more appropriate.

Logical operators (cont’d)

  • If we would like the elements that are smaller than 3 OR larger than 3, we could type
c[c < 3 | c > 3] #c smaller than 3 or larger than 3
## [1] 1 2 4 5 1 2 4 5

or

c[c != 3] #c not equal to 3
## [1] 1 2 4 5 1 2 4 5

Logical operators (cont’d)

  • In fact, c != 3 returns a matrix
##       [,1]  [,2]
## [1,]  TRUE  TRUE
## [2,]  TRUE  TRUE
## [3,] FALSE FALSE
## [4,]  TRUE  TRUE
## [5,]  TRUE  TRUE
  • Remember c?:
##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3
## [4,]    4    4
## [5,]    5    5

Things that cannot be done

  • Things that have no representation in real number space (at least not without tremendous effort)
    • For example, the following code returns “Not a Number”
0 / 0
## [1] NaN
  • Also impossible are calculations based on missing values (NA’s)
mean(c(1, 2, NA, 4, 5))
## [1] NA

Standard solves for missing values

There are two easy ways to perform “listwise deletion”:

mean(c(1, 2, NA, 4, 5), na.rm = TRUE)
## [1] 3
mean(na.omit(c(1, 2, NA, 4, 5)))
## [1] 3

Floating point example

(3 - 2.9)
## [1] 0.1
(3 - 2.9) <= 0.1
## [1] FALSE

Why does R tell us that 3 - 2.9 \(\neq\) 0.1?

(3 - 2.9) - .1
## [1] 8.326673e-17

Some programming tips:

  • keep your code tidy
  • use comments (text preceded by #) to clarify what you are doing
    • If you look at your code again, one month from now: you will not know what you did –> unless you use comments
  • when working with functions, use the TAB key to quickly access the help for the function’s components
  • work with logically named R-scripts
    • indicate the sequential nature of your work
  • work with RStudio projects
  • if allowed, place your project folders in some cloud-based environment

The blueprint of R

Layers in R

There are several ‘layers’ in R. Some layers you are allowed to fiddle around in, some are forbidden. In general there is the following distinction:

  • The global environment.
  • User environments
  • Functions
  • Packages
  • Namespaces

Environments

The global environment can be seen as a olympic-size swimming pool. Everything you do has its place there.

If you’d like, you may create another, separate environment to work in.

  • A user environment would by default not have access to other environments

Functions

  • If you create a function, it is positioned in the global environment.

  • Everything that happens in a function, stays in a function. Unless you specifically tell the function to share the information with the global environment.

  • See functions as a shampoo bottle in a swimming pool to which you add some water. If you’d like to see the color of the mixture, you’d have to squeeze the bottle for it to come out.

Packages

  • Packages have their own space.

    • Everything needed to run the functions in a package is needly contained within its own space
    • See packages as separate (mini) pools that are connected to the main pool (the global environment)

Loading packages

There are two ways to load a package in R

library(stats)

and

require(stats)

require() will produce a warning when a package is not found. In other words, it will not stop as function library() does.

Installing packages

The easiest way to install e.g. package mice is to use

install.packages("mice")

Alternatively, you can also do it in RStudio through

Tools --> Install Packages

Namespaces

  • Namespaces. These are the deeper layers that feed new water to the surface of the mini pools.
    • Packages can have namespaces.
    • Functions within packages are executed within the package or namespace and have access to the global environment.
    • Objects in the global environment that match objects in the function’s namespace are ignored when running functions from packages!

R in depth

Workspaces and why you should sometimes save them

A workspace contains all changes you made to environments, functions and namespaces.

A saved workspace contains everything at the time of the state wherein it was saved.

You do not need to run all the previous code again if you would like to continue working at a later time.

  • You can save the workspace and continue exactly where you left.

Workspaces are compressed and require relatively little memory when stored. The compression is very efficient and beats reloading large datasets from raw text.

History and why it is useful

R by default saves (part of) the code history and RStudio expands this functionality greatly.

Most often it may be useful to look back at the code history for various reasons.

  • There are multiple ways to access the code history.

    1. Use arrow up in the console. This allows you to go back in time, one codeline by one. Extremely useful to go back to previous lines for minor alterations to the code.
    2. Use the history tab in the environment pane. The complete project history can be found here and the history can be searched. This is particularly convenient when you know what code you are looking for.

Working in projects in RStudio

  • Every project has its own history
  • Every research project has its own project
  • Every project can have its own folder, which also serves as a research archive
  • Every project can have its own version control system
  • R-studio projects can relate to Git (or other online) repositories

Modeling

Modeling in R

To model objects based on other objects, we use ~ (tilde)

- For example, to model body mass index (BMI) on weight, we would type
BMI ~ weight

Tilde is used to separate the left- and right-hand sides in a model formula.

For functions (or models), within models we use I() - For example, to model body mass index (BMI) on its deterministic function of weight and height, we would type

BMI ~ I(weight / height^2)

Example

Remember the boys data from package mice:

lm(bmi ~ wgt, data = boys)
## 
## Call:
## lm(formula = bmi ~ wgt, data = boys)
## 
## Coefficients:
## (Intercept)          wgt  
##     14.5401       0.0935

Example continued

Remember the boys data from package mice:

lm(bmi ~ I(wgt / (hgt / 100)^2), data = boys)
## 
## Call:
## lm(formula = bmi ~ I(wgt/(hgt/100)^2), data = boys)
## 
## Coefficients:
##        (Intercept)  I(wgt/(hgt/100)^2)  
##          -0.005553            1.000034

More efficient

It is ‘nicer’ to store the output from the function in an object. The convention for regression models is an object called fit.

fit <- lm(bmi ~ I(wgt / (hgt / 100)^2), data = boys)

The object fit contains a lot more than just the regression weights. To inspect what is inside you can use

ls(fit)
##  [1] "assign"        "call"          "coefficients"  "df.residual"  
##  [5] "effects"       "fitted.values" "model"         "na.action"    
##  [9] "qr"            "rank"          "residuals"     "terms"        
## [13] "xlevels"

Inspecting what is inside fit

Another approach to inspecting the contents of fit is the function attributes()

attributes(fit)
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "xlevels"       "call"          "terms"        
## [13] "model"        
## 
## $class
## [1] "lm"

The benefit of using attributes() is that it directly tells you the class of the object.

Classes in R

class(fit)
## [1] "lm"

Classes are used for an object-oriented style of programming. This means that you can write a specific function that - has fixed requirements with respect to the input. - presents output or graphs in a predefined manner.

When a generic function fun is applied to an object with class attribute c("first", "second"), the system searches for a function called fun.first and, if it finds it, applies it to the object.

If no such function is found, a function called fun.second is tried. If no class name produces a suitable function, the function fun.default is used (if it exists). If there is no class attribute, the implicit class is tried, then the default method.

Classes example: plotting without class

plot(bmi ~ wgt, data = boys)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 1)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 2)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 3)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 4)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 5)

Classes example: plotting with class

plot(lm(bmi ~ wgt, data = boys), which = 6)

Why is plot different for class "lm"?

The function plot() is called, but not used. Instead, because the linear model has class "lm", R searches for the function plot.lm().

If function plot.lm() would not exist, R tries to apply function plot() (which would have failed in this case because plot requires x and y as input)

plot.lm() is created by John Maindonald and Martin Maechler. They thought it would be useful to have a standard plotting environment for objects with class "lm".

Since the elements that class "lm" returns are known, creating a generic function class is straightforward.

R-coding

The Google style guide

Naming conventions

File Names

File names should end in .R and, of course, be meaningful.

GOOD:

predict_ad_revenue.R

BAD:

foo.R

Identifiers

Don’t use underscores ( _ ) or hyphens ( - ) in identifiers. Identifiers should be named according to the following conventions.

  1. The preferred form for variable names is all lower case letters and words separated with dots (variable.name), but variableName is also accepted;
  2. function names have initial capital letters and no dots (FunctionName);
  3. constants are named like functions but with an initial k.

Identifiers (continued)

  • variable.name is preferred, variableName is accepted
    GOOD: avg.clicks
    OK: avgClicks
    BAD: avg_Clicks

  • FunctionName
    GOOD: CalculateAvgClicks
    BAD: calculate_avg_clicks , calculateAvgClicks
  • kConstantName

Syntax

Line Length

The maximum line length is 80 characters.

# This is to demonstrate that at about eighty characters you would move off of the page

# Also, if you have a very wide function
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt + wgt * hgt + wgt * hgt * bmi, data = boys)

# it would be nice to pose it as
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt 
          + bmi * wgt + wgt * hgt + wgt * hgt * bmi, data = boys)
#or
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg 
          + bmi * hgt 
          + bmi * wgt
          + wgt * hgt 
          + wgt * hgt * bmi, 
          data = boys)

Indentation

When indenting your code, use two spaces. RStudio does this for you!

Never use tabs or mix tabs and spaces.

Exception: When a line break occurs inside parentheses, align the wrapped line with the first character inside the parenthesis.

Spacing

Place spaces around all binary operators (=, +, -, <-, etc.).

Exception: Spaces around =’s are optional when passing parameters in a function call.

lm(age ~ bmi, data=boys)

or

lm(age ~ bmi, data = boys)

Spacing (continued)

Do not place a space before a comma, but always place one after a comma.

GOOD:

tab.prior <- table(df[df$days.from.opt < 0, "campaign.id"])
total <- sum(x[, 1])
total <- sum(x[1, ])

Spacing (continued)

BAD:

# Needs spaces around '<'
tab.prior <- table(df[df$days.from.opt<0, "campaign.id"])  
# Needs a space after the comma
tab.prior <- table(df[df$days.from.opt < 0,"campaign.id"])  
# Needs a space before <-
tab.prior<- table(df[df$days.from.opt < 0, "campaign.id"]) 
# Needs spaces around <-
tab.prior<-table(df[df$days.from.opt < 0, "campaign.id"])  
# Needs a space after the comma
total <- sum(x[,1])  
# Needs a space after the comma, not before 
total <- sum(x[ ,1])  

Spacing (continued)

Place a space before left parenthesis, except in a function call.

GOOD:

if (debug)

BAD:

if(debug)

Extra spacing

Extra spacing (i.e., more than one space in a row) is okay if it improves alignment of equals signs or arrows (<-).

plot(x    = x.coord,
     y    = data.mat[, MakeColName(metric, ptiles[1], "roiOpt")],
     ylim = ylim,
     xlab = "dates",
     ylab = metric,
     main = (paste(metric, " for 3 samples ", sep = "")))

Do not place spaces around code in parentheses or square brackets.

Exception: Always place a space after a comma.

Extra spacing

GOOD:

if (debug)
x[1, ]

BAD:

if ( debug )  # No spaces around debug
x[1,]  # Needs a space after the comma 

In general…

  • Use common sense and BE CONSISTENT.

  • The point of having style guidelines is to have a common vocabulary of coding
    • so people can concentrate on what you are saying, rather than on how you are saying it.
  • If code you add to a file looks drastically different from the existing code around it, the discontinuity will throw readers out of their rhythm when they go to read it. Try to avoid this.